1.- One dimensional Partial Dependence Plot.

The partial dependence plot shows the marginal effect of a feature on the predicted outcome of a previously fit model.

EXERCISE:

Apply PDP to the regression example of predicting bike rentals. Fit a random forest approximation for the prediction of bike rentals (cnt). Use the partial dependence plot to visualize the relationships the model learned. Use the slides shown in class as model.

library(randomForestSRC)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(plotly)
library(reshape2)
library(lubridate)
library(pdp)
library(randomForest)
library(cowplot)


days <- read.csv("/Users/albitadubonllamas/Desktop/Practica5EDM/day.csv", row.names = 1)
hour <- read.csv("/Users/albitadubonllamas/Desktop/Practica5EDM/hour.csv")

Preparación base de datos. Ajuste variables:

#class(days$dteday)
days$dteday <- as_date(days$dteday)

data <- days %>% 
  mutate(spring = ifelse(season == 1, 1, 0),
         summer = ifelse(season == 2, 1, 0),
         fall = ifelse(season == 3, 1, 0))
data <- data %>% 
  mutate(MISTY = ifelse(weathersit == 2, 1, 0),
         RAIN = ifelse(weathersit %in% c(3,4), 1, 0))

data$temp <- days$temp * 39 + (-8)
data$hum <- days$hum * 100
data$windspeed <- days$windspeed * 67
start_date <- as.Date("2011-01-01")
data$days_since_2011 <- as.numeric(as.Date(days$dteday) - start_date)
data <- data[, c("holiday", "workingday", "summer", "MISTY", "RAIN", "temp", "fall", "hum", "windspeed", "days_since_2011", "cnt")]
rf <- randomForest(cnt~., data=data)
rf
## 
## Call:
##  randomForest(formula = cnt ~ ., data = data) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 432449.1
##                     % Var explained: 88.46

DAYS_SINCE

pdp<- pdp::partial(rf, pred.var = 'days_since_2011', plot = F)
#pdp <- as.data.frame(pdp)
plot_ly(pdp, x = ~days_since_2011, y = ~yhat, type = "scatter", mode = "lines")

HUMIDITY

pdp<- pdp::partial(rf, pred.var = 'hum', plot = F)
#pdp <- as.data.frame(pdp)
plot_ly(pdp, x = ~hum, y = ~yhat, type = "scatter", mode = "lines")

TEMPERATURE

pdp<- pdp::partial(rf, pred.var = 'temp', plot = F)
#pdp <- as.data.frame(pdp)
plot_ly(pdp, x = ~temp, y = ~yhat, type = "scatter", mode = "lines")

WIND SPEED

pdp<- pdp::partial(rf, pred.var = 'windspeed', plot = F)
#pdp <- as.data.frame(pdp)
plot_ly(pdp, x = ~windspeed, y = ~yhat, type = "scatter", mode = "lines")

QUESTION:

Analyse the influence of days since 2011, temperature, humidity and wind speed on the predicted bike counts.

After making the graphs of temperature, humidity, and wind speed using bicycles, we observed that the use of bicycles remains at approximately 4,600 uses. However, when humidity exceeds 60%, bicycle usage decreases considerably, even reaching as low as 3,600 daily uses. It is also observed that the highest bicycle usage occurs when humidity is between 40 and 60%.

On the other hand, we have temperature, in which we observe that bicycle usage increases as the temperature rises, indicating a direct relationship between bicycle usage and temperature degrees. However, when the temperature exceeds 20 degrees, usage declines and starts to drop.

Furthermore, if we observe wind speed, we notice an inversely proportional relationship: the higher the wind speed, the lower the bicycle usage. Specifically, when the speed is around 5 km/h, bicycle usage is at its highest, reaching almost 4,700 uses.

Finally, we observe that the graph regarding days since 2011 is more uneven, showing several curves. This might be confusing, suggesting that the day of the year has no relation to bicycle rentals. However, the relationship lies in the fact that even though there are points (days) where demand slightly decreases, it always tends to increase. This leads us to believe that society has become more aware of sustainable vehicles and their usefulness, resulting in an increase in bicycle rentals.

Taking these results into account, we can determine that bicycle rentals depend greatly on the environmental conditions, making it unpleasant for consumers to use them in extreme temperatures, high humidity, and strong wind speeds. Additionally, it can be inferred that as time goes on, the demand for rentals will continue to increase.


2.- Bidimensional Partial Dependency Plot.

EXERCISE:

Generate a 2D Partial Dependency Plot with humidity and temperature to predict the number of bikes rented depending on those parameters.

BE CAREFUL: due to the size, extract a set of random samples from the BBDD before generating the data for the Partial Dependency Plot.

Show the density distribution of both input features with the 2D plot as shown in the class slides.

TIP: Use geom_tile() to generate the 2D plot. Set width and height to avoid holes.
sample <- data %>% sample_n(size = 400, replace = FALSE)
rf_sample <- randomForest(cnt ~., data = sample)


pdp::partial(rf_sample, pred.var = c("temp", "hum"), train = sample, grid.resolution = 20, plot=T)

# Density distributions for input features

temp_plot <- ggplot(sample, aes(x = temp)) + 
  geom_density(fill = "blue", alpha = 0.5) +
  labs(x = "Temperature", y = "Density") +
  theme_bw()

hum_plot <- ggplot(sample, aes(x = hum)) + 
  geom_density(fill = "blue", alpha = 0.5) +
  labs(x = "Humidity", y = "Density") +
  theme_bw()



plot_grid(temp_plot, hum_plot, ncol = 1, nrow = 2)

QUESTION:

Interpret the results:

In the heat map, we can see how temperature and humidity influence the number of rented bicycles. Firstly, regardless of humidity, we observe a clear dependence of bicycle rentals on temperature. We determine this by noticing that there are very few bicycle rentals at extremely low temperatures, even reaching -5. As the temperature increases, the number of rentals also increases. However, we can see a slight decrease in bicycle rentals after reaching a temperature value of 25. This suggests that with very high temperatures, the demand for rentals may start to decrease.

Regarding humidity, we can see that its influence on bicycle rentals is not as pronounced, but there is a repetitive behavior throughout the entire heat map. We observe that higher humidity is associated with fewer bicycle rentals, generally independent of the temperature.

Additionally, we have generated two density plots. The increasing curve of the temperature density indicates a higher concentration of observations in certain temperature ranges. As the temperature increases from 4 to 20, the density remains relatively constant, suggesting an optimal temperature zone where observations are more common. This pattern aligns with the interpretation of the heat map, where an increase in bicycle rentals is observed as the temperature rises.

Lastly, the increasing curve followed by a decrease in humidity density indicates a non-linear relationship between humidity and the number of bicycle rentals. As humidity increases from 0 to approximately 50, the density of bicycle rentals also increases. However, beyond a humidity value of around 50, the density starts to decrease. This suggests that there may be an optimal range of humidity where bicycle rentals are more common, and higher or lower humidity values may be less attractive to users. This trend is consistent with the interpretation of the heat map, where a decrease in bicycle rentals is observed as humidity increases.


3.- PDP to explain the price of a house.

EXERCISE:

Apply the previous concepts to predict the price of a house from the database kc_house_data.csv. In this case, use again a random forest approximation for the prediction based on the features bedrooms, bathrooms, sqft_living, sqft_lot, floors and yr_built. Use the partial dependence plot to visualize the relationships the model learned.

BE CAREFUL: due to the size, extract a set of random samples from the BBDD before generating the data for the Partial Dependency Plot.
house <- read.csv("/Users/albitadubonllamas/Desktop/Practica5EDM/kc_house_data.csv")
house_sample <- house %>% sample_n(size = 10000, replace = FALSE)
house_data_rf <- house_sample[, c('price',"bedrooms", "bathrooms", "sqft_living", "sqft_lot", "floors", "yr_built")]

rf_house <- randomForest(price~., data=house_data_rf)
pdp_bed <- pdp::partial(rf_house, pred.var = "bedrooms", plot = TRUE, plot.engine = "ggplot2") + ylab("Price")
pdp_bath <- pdp::partial(rf_house, pred.var = "bathrooms", plot = TRUE, plot.engine = "ggplot2") + ylab("Price") + scale_y_continuous(labels = scales::number_format(scale = 1, big.mark = ""))
pdp_sqft <- pdp::partial(rf_house, pred.var = "sqft_living", plot = TRUE,plot.engine = "ggplot2") + ylab("Price")
pdp_floors <- pdp::partial(rf_house, pred.var = "floors", plot = TRUE, plot.engine = "ggplot2") + ylab("Price")
pdp_yr_built <- pdp::partial(rf_house, pred.var = "yr_built", plot = TRUE, plot.engine = "ggplot2") + ylab("Price")
pdp_sqft_lot <- pdp::partial(rf_house, pred.var = "sqft_lot", plot = TRUE, plot.engine = "ggplot2") + ylab("Price")
library(ggplot2)

p1 <- ggplot(pdp_bed$data, aes(x = bedrooms, y = yhat)) + 
      geom_line(color = "blue") +
      labs(title = "PDP for Bedrooms", x = "Bedrooms", y = "Predicted Value")

p2 <- ggplot(pdp_bath$data, aes(x = bathrooms, y = yhat)) + 
      geom_line(color = "blue") +
      labs(title = "PDP for Bathrooms", x = "Temperature", y = "Predicted Value")

p3 <- ggplot(pdp_sqft$data, aes(x = sqft_living, y = yhat)) + 
      geom_line(color = "blue") +
      labs(title = "PDP for Square Footage", x = "Sqft_living", y = "Predicted Value")

p4 <- ggplot(pdp_floors$data, aes(x = floors, y = yhat)) + 
      geom_line(color = "blue") +
      labs(title = "PDP for Number of Floors", x = "Floors", y = "Predicted Value")

p5 <- ggplot(pdp_yr_built$data, aes(x = yr_built, y = yhat)) + 
      geom_line(color = "blue") +
      labs(title = "PDP for Year Built", x = "Year Built", y = "Predicted Value")

p6 <- ggplot(pdp_sqft_lot$data, aes(x = sqft_lot, y = yhat)) + 
      geom_line(color = "blue") +
      labs(title = "PDP for sqft_lot", x = "sqft_lot", y = "Predicted Value")

plot_grid(p1, p2, p3, p4,p5,p6, ncol = 3, nrow = 2)

QUESTION:

Analyse the influence of bedrooms, bathrooms, sqft_living and floors on the predicted price.

In the obtained graphs, we can observe how different features of a house affect its sale price.

Regarding the number of bedrooms, we observe a descending line, indicating that fewer bedrooms correspond to higher housing prices. In this case, we notice a peak around 3 bedrooms and a significant drop starting from approximately 7 bedrooms. This specific result is not conclusive as it goes against the logical supply and demand dynamics of the real estate market. It appears to suggest that the more square meters a house has, the more expensive it will be. However, an increase in the number of bedrooms typically means more space, i.e., more square meters.

On the other hand, the behavior for the number of bathrooms and the number of floors is similar to that of the square footage. This makes sense given the explanation mentioned above. It indicates that as these features increase, the housing price also tends to increase.

Lastly, the year of construction of the house, along with the square footage of the lot, although yielding different results, seem to follow a similar pattern. In both cases, the housing price is higher at the beginning of the x-axis (an older house and a smaller lot size), but after decreasing, it starts to increase again. This also appears to be illogical. In the case of the year of construction, it would be more expected to find that newer houses have higher prices in any case, as they tend to have more modern features and be in better condition compared to older houses. Additionally, similar to the square footage of the living area, a larger lot size should result in a higher price, as it has more potential for development and use, as well as providing more space in the house.

Considering all the results, we see that some of them are inconclusive, and therefore, what we can assert is that greater space corresponds to a higher house price, which we observe accurately in the graphs regarding bathrooms, floors, and living area.